Webscraping with
tidyverse
Packages


Sam Tyner
(co-organizer R-Ladies Ames)

9 Feb 2017

Outline

  1. Introduction
    • What is webscraping?
    • Why webscrape?
  2. Webscraping in R
    • Available packages (other than tidyverse)
  3. The tidyverse?
  4. rvest quick start guide
    • Your Turn #1
  5. Deeper dive into rvest
    • Key functions
    • Your Turn #2
  6. Advanced Examples

Introduction

What is webscraping?

  • Extract data from websites
    • Tables
    • Links to other websites
    • Text

Why webscrape?

  • Because copy-paste is awful
  • Because it’s fast
  • Because you can automate it

Resources for
Webscraping in R

R Packages
for Webscraping

Lots to choose from: XML, XML2R, scrapeR, selectr, rjson, RSelenium, etc.

Many more (and links to the above) on the Web Technologies CRAN Task View

But, we’ll be using the tidyverse packages rvest and xml2

What is the tidyverse?

The Tidy Tools Manifesto

“The tidyverse is a set of packages that work in harmony…. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.” - RStudio Blog

  1. Reuse existing data structures. (i.e. stick with data frames!)
  2. Compose simple functions with the pipe. (Each function does one simple thing well.)
  3. Embrace functional programming. (OOPers may find this difficult. If you are totally lost, you’ll be fine.)
  4. Design for humans. (Code should be understood by humans first, then computers)

Familiar Friends

You may already have used:

  • ggplot2 for visualization
  • dplyr for data manipulation
  • tidyr for data tidying

Install all tidyverse packages in one fell swoop:

# check if you already have it
library(tidyverse)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyverse

tidyverse packages
for web data

  • httr: for web APIs (Application Programming Interface)
  • jsonlite: for JSON (JavaScript Object Notation) data from the web
  • xml2: for XML (eXtensible Markup Language) structured data
  • rvest: package of wrapper functions to xml2 and httr for easy web scraping

We’ll focus on rvest

Webscraping with rvest:
Step-by-Step Start Guide

Step 1: Find a URL

What data do you want?

  • Information on Oscar-nominated film Moonlight

Find it on the web!

# character variable containing the url you want to scrape
myurl <- "http://www.imdb.com/title/tt4975722/"

Step 2: Read HTML into R

“Huh? What am I doing?” - some of you right now

  • HTML is HyperText Markup Language. All webpages are written with it.
  • Go to any website, right click, click “View Page Source” to see the HTML
library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml
## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n    if (typeof ue ...

Step 3: Figure out
where your data is

Need to find your data within the myhtml object.

Tags to look for:

  • <p>: paragraphs
  • <h1>, <h2>, etc.: headers
  • <a>: links
  • <li>: item in a list
  • <table>: tables

Use Selector Gadget to find the exact location. (Demo)

For more on HTML, I recommend W3schools’ tutorial >- You don’t need to be an expert in HTML to webscrape with rvest!

Step 4: Tell rvest where to find your data

Copy-paste from Selector Gadget or give HTML tags into html_nodes() to extract your data of interest

myhtml %>% html_nodes(".summary_text") %>% html_text()
## [1] "\n                    A timeless story of human self-discovery and connection, Moonlight chronicles the life of a young black man from childhood to adulthood as he struggles to find his place in the world while growing up in a rough neighborhood of Miami.\n            "
myhtml %>% html_nodes("table") %>% html_table(header = TRUE)
## [[1]]
##    Cast overview, first billed only: Cast overview, first billed only:
## 1                                 NA                    Mahershala Ali
## 2                                 NA                      Shariff Earp
## 3                                 NA                    Duan Sanderson
## 4                                 NA                   Alex R. Hibbert
## 5                                 NA                     Janelle Monáe
## 6                                 NA                     Naomie Harris
## 7                                 NA                       Jaden Piner
## 8                                 NA            Herman 'Caheei McGloun
## 9                                 NA                  Kamal Ani-Bellow
## 10                                NA                      Keomi Givens
## 11                                NA                   Eddie Blanchard
## 12                                NA                       Rudi Goblen
## 13                                NA                    Ashton Sanders
## 14                                NA                        Edson Jean
## 15                                NA                    Patrick Decile
##    Cast overview, first billed only:
## 1                                ...
## 2                                ...
## 3                                ...
## 4                                ...
## 5                                ...
## 6                                ...
## 7                                ...
## 8                                ...
## 9                                ...
## 10                               ...
## 11                               ...
## 12                               ...
## 13                               ...
## 14                               ...
## 15                               ...
##                        Cast overview, first billed only:
## 1                                                   Juan
## 2                                               Terrence
## 3            Azu \n  \n  \n  (as Duan 'Sandy' Sanderson)
## 4                   Little \n  \n  \n  (as Alex Hibbert)
## 5                                                 Teresa
## 6                                                  Paula
## 7                                              Kevin (9)
## 8  Longshoreman \n  \n  \n  (as Herman 'Caheej' McCloun)
## 9                                         Portable Boy 1
## 10                                        Portable Boy 2
## 11                                        Portable Boy 3
## 12                                                   Gee
## 13                                                Chiron
## 14                                            Mr. Pierce
## 15                                                Terrel
## 
## [[2]]
##   Straight blk male friends won't see it because they think its a gay film
## 1                                               My problem with this movie
## 2                                Most overrated movie of the Year - BORING
## 3                                              What happened to the bully?
## 4                                                                      Meh
## 5      Do you really think Mahershala Ali deserves the hype for this role?
##   cliffcarson-502-470231
## 1                s_a-k_y
## 2         JSchoenleber70
## 3              cuterstar
## 4          jamesforsythe
## 5            jvcksonsmth
## 
## [[3]]
##                     Amazon Affiliates
## 1 Amazon VideoWatch Movies &TV Online
##                              Amazon Affiliates
## 1 Prime VideoUnlimited Streamingof Movies & TV
##                          Amazon Affiliates
## 1 Amazon GermanyBuy Movies onDVD & Blu-ray
##                        Amazon Affiliates
## 1 Amazon ItalyBuy Movies onDVD & Blu-ray
##                         Amazon Affiliates
## 1 Amazon FranceBuy Movies onDVD & Blu-ray
##                       Amazon Affiliates          Amazon Affiliates
## 1 Amazon IndiaBuy Movie andTV Show DVDs DPReviewDigitalPhotography
##            Amazon Affiliates
## 1 AudibleDownloadAudio Books

Step 5: Save & tidy data

library(stringr)
library(magrittr)
mydat <- myhtml %>% 
  html_nodes("table") %>%
  extract2(1) %>% 
  html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>% 
  mutate(Actor = Actor,
         Role = str_replace_all(Role, "\n  ", ""))
mydat
##                     Actor                                      Role
## 1          Mahershala Ali                                      Juan
## 2            Shariff Earp                                  Terrence
## 3          Duan Sanderson           Azu (as Duan 'Sandy' Sanderson)
## 4         Alex R. Hibbert                  Little (as Alex Hibbert)
## 5           Janelle Monáe                                    Teresa
## 6           Naomie Harris                                     Paula
## 7             Jaden Piner                                 Kevin (9)
## 8  Herman 'Caheei McGloun Longshoreman (as Herman 'Caheej' McCloun)
## 9        Kamal Ani-Bellow                            Portable Boy 1
## 10           Keomi Givens                            Portable Boy 2
## 11        Eddie Blanchard                            Portable Boy 3
## 12            Rudi Goblen                                       Gee
## 13         Ashton Sanders                                    Chiron
## 14             Edson Jean                                Mr. Pierce
## 15         Patrick Decile                                    Terrel

Your Turn #1

Using rvest, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country GDP per capita example from earlier.

Your result should be a data frame with one observation per row and one variable per column.

Your Turn #1 Solution

library(rvest)
library(magrittr)
myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
myhtml <- read_html(myurl)
myhtml %>% 
 html_nodes("table") %>%
 extract2(2) %>%
 html_table(header = TRUE) %>% 
 mutate(`Int$` = parse_number(`Int$`)) %>% 
 head
##   Rank           Country   Int$
## 1    1             Qatar 129727
## 2    2        Luxembourg 101936
## 3    3             Macau  96148
## 4    4         Singapore  87082
## 5    5 Brunei Darussalam  79711
## 6    6            Kuwait  71264

Deeper dive into rvest

Key Functions: html_nodes

  • html_nodes(x, "path") extracts all elements from the page x that have the tag / class / id path. (Use SelectorGadget to determine path.)
  • html_node() does the same thing but only returns the first matching element.
  • Can be chained
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") # then get all the links in those paragraphs
## {xml_nodeset (22)}
##  [1] <a href="/wiki/Purchasing_power_parity" title="Purchasing power par ...
##  [2] <a href="/wiki/Goods_and_services" title="Goods and services">goods ...
##  [3] <a href="/wiki/Gross_domestic_product" title="Gross domestic produc ...
##  [4] <a href="/wiki/Per_capita" title="Per capita">per capita</a>
##  [5] <a href="/wiki/International_Monetary_Fund" title="International Mo ...
##  [6] <a href="/wiki/World_Bank" title="World Bank">World Bank</a>
##  [7] <a href="/wiki/National_wealth" title="National wealth">national we ...
##  [8] <a href="/wiki/Savings" class="mw-redirect" title="Savings">savings ...
##  [9] <a href="/wiki/Cost_of_living" title="Cost of living">cost of livin ...
## [10] <a href="/wiki/List_of_countries_by_GDP_(nominal)_per_capita" title ...
## [11] <a href="https://en.wiktionary.org/wiki/generalized" class="extiw"  ...
## [12] <a href="/wiki/Living_standards" class="mw-redirect" title="Living  ...
## [13] <a href="/wiki/Inflation_rates" class="mw-redirect" title="Inflatio ...
## [14] <a href="/wiki/Exchange_rates" class="mw-redirect" title="Exchange  ...
## [15] <a href="#cite_note-2">[2]</a>
## [16] <a href="#cite_note-3">[3]</a>
## [17] <a href="/wiki/Personal_income" title="Personal income">personal in ...
## [18] <a href="/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_W ...
## [19] <a href="/wiki/Economy" title="Economy">economies</a>
## [20] <a href="/wiki/Sovereign_state" title="Sovereign state">sovereign s ...
## ...

Key Functions: html_text

  • html_text(x) extracts all text from the nodeset x
  • Good for cleaning output
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") %>% # then get all the links in those paragraphs
  html_text() # get the linked text only 
##  [1] "purchasing power parity"                      
##  [2] "goods and services"                           
##  [3] "gross domestic product"                       
##  [4] "per capita"                                   
##  [5] "International Monetary Fund"                  
##  [6] "World Bank"                                   
##  [7] "national wealth"                              
##  [8] "savings"                                      
##  [9] "cost of living"                               
## [10] "List of countries by GDP (nominal) per capita"
## [11] "generalized"                                  
## [12] "living standards"                             
## [13] "inflation rates"                              
## [14] "exchange rates"                               
## [15] "[2]"                                          
## [16] "[3]"                                          
## [17] "personal income"                              
## [18] "Standard of living and GDP"                   
## [19] "economies"                                    
## [20] "sovereign states"                             
## [21] "dependent territories"                        
## [22] "Geary–Khamis dollars"

Key Functions: html_table

  • html_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames
  • Structure of HTML makes finding and extracting tables easy!
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  head(2) # look at first 2
## {xml_nodeset (2)}
## [1] <table style="font-size:95%;">\n<tr>\n<td width="33%" align="center" ...
## [2] <table class="wikitable sortable" style="margin-left:auto;margin-rig ...
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  extract2(2) %>% # pick the second one to parse
  html_table(header = TRUE) # parse table 
##     Rank                          Country    Int$
## 1      1                            Qatar 129,727
## 2      2                       Luxembourg 101,936
## 3      3                            Macau  96,148
## 4      4                        Singapore  87,082
## 5      5                Brunei Darussalam  79,711
## 6      6                           Kuwait  71,264
## 7      7                          Ireland  69,375
## 8      8                           Norway  69,296
## 9      9             United Arab Emirates  67,696
## 10    10                     Saudi Arabia  65,000
## 11    11                       San Marino  64,443
## 12    12                      Switzerland  59,376
## 13    13                        Hong Kong  58,095
## 14    14                    United States  57,294
## 15    15                      Netherlands  50,846
## 16    16                          Bahrain  50,303
## 17    17                           Sweden  49,678
## 18    18                        Australia  48,806
## 19    19                          Germany  48,190
## 20    20                          Iceland  48,070
## 21    21                          Austria  47,856
## 22    22                           Taiwan  47,790
## 23    23                          Denmark  46,603
## 24    24                           Canada  46,240
## 25    25                          Belgium  44,881
## 26    26                             Oman  43,737
## 27    27                   United Kingdom  42,514
## 28    28                           France  42,384
## 29    29                          Finland  41,813
## 30    30                            Japan  38,894
## 31    31                Equatorial Guinea  38,699
## 32    32                      South Korea  37,948
## 33    33                            Malta  37,891
## 34    34                      Puerto Rico  37,723
## 35    35                      New Zealand  37,108
## 36    36                            Spain  36,451
## 37    37                            Italy  36,313
## 38    38                           Israel  34,834
## 39    39                           Cyprus  34,387
## 40    40                   Czech Republic  33,223
## 41    41                         Slovenia  32,028
## 42    42              Trinidad and Tobago  31,934
## 43    43                  Slovak Republic  31,182
## 44    44                        Lithuania  29,882
## 45    45                          Estonia  29,502
## 46    46                         Portugal  28,515
## 47    47                       Seychelles  28,148
## 48    48                           Poland  27,715
## 49    49                         Malaysia  27,234
## 50    50                          Hungary  27,211
## 51    51                           Greece  26,809
## 52    52                           Russia  26,109
## 53    53                           Latvia  25,740
## 54    54                       Kazakhstan  25,669
## 55    55              St. Kitts and Nevis  25,372
## 56    56                      The Bahamas  24,618
## 57    57              Antigua and Barbuda  24,050
## 58    58                            Chile  23,969
## 59    59                           Panama  22,788
## 60    60                          Croatia  22,415
## 61    61                          Romania  22,319
## 62    62                          Uruguay  21,570
## 63    63                           Turkey  21,147
## 64    64                        Mauritius  20,525
## 65    65                        Argentina  20,171
## 66    66                         Bulgaria  20,116
## 67    67                            Gabon  19,252
## 68    68                           Mexico  18,865
## 69    69                          Lebanon  18,524
## 70    70                             Iran  18,136
## 71    71                       Azerbaijan  17,688
## 72    72                          Belarus  17,497
## 73    73                     Turkmenistan  17,347
## 74    74                         Barbados  17,137
## 75    75                       Montenegro  17,035
## 76    76                         Botswana  16,948
## 77    77                         Thailand  16,835
## 78    78                             Iraq  16,544
## 79    79                       Costa Rica  16,142
## 80    80               Dominican Republic  15,946
## 81    81                            China  15,424
## 82    82                         Maldives  15,288
## 83    83                            Palau  15,260
## 84    84                           Brazil  15,211
## 85    85                         Suriname  15,180
## 86    86                        Venezuela  15,103
## 87    87                          Algeria  14,950
## 88    88                        Macedonia  14,530
## 89    89                            Libya  14,236
## 90    90                           Serbia  14,226
## 91    91                         Colombia  14,162
## 92    92                          Grenada  14,102
## 93    93                     South Africa  13,179
## 94    94                             Peru  13,019
## 95    95                         Mongolia  12,161
## 96    96                            Egypt  12,137
## 97    97                        St. Lucia  11,970
## 98    98                          Albania  11,861
## 99    99                          Namibia  11,756
## 100  100                        Indonesia  11,699
## 101  101                          Tunisia  11,657
## 102  102                         Dominica  11,484
## 103  103   St. Vincent and the Grenadines  11,267
## 104  104                        Sri Lanka  11,189
## 105  105                           Jordan  11,125
## 106  106                          Ecuador  11,037
## 107  107           Bosnia and Herzegovina  11,034
## 108  108                          Georgia  10,100
## 109  109                        Swaziland   9,768
## 110  110                         Paraguay   9,354
## 111  111                             Fiji   9,353
## 112  112                          Jamaica   8,974
## 113  113                      El Salvador   8,914
## 114  114                          Armenia   8,881
## 115  115                          Morocco   8,360
## 116  116                          Ukraine   8,230
## 117  117                           Belize   8,186
## 118  118                           Bhutan   8,129
## 119  119                        Guatemala   7,937
## 120  120                           Guyana   7,920
## 121  121                      Philippines   7,696
## 122  122                          Bolivia   7,191
## 123  123                           Angola   6,844
## 124  124                Republic of Congo   6,787
## 125  125                       Cabo Verde   6,744
## 126  126                            India   6,658
## 127  127                       Uzbekistan   6,453
## 128  128                          Vietnam   6,422
## 129  129                          Myanmar   5,953
## 130  130                          Nigeria   5,930
## 131  131                             Laos   5,719
## 132  132                            Samoa   5,369
## 133  133                            Tonga   5,332
## 134  134                        Nicaragua   5,280
## 135  135                         Honduras   5,264
## 136  136                          Moldova   5,218
## 137  137                         Pakistan   5,120
## 138  138                            Sudan   4,452
## 139  139                       Mauritania   4,405
## 140  140                            Ghana   4,381
## 141  141                      Timor-Leste   4,186
## 142  142                           Zambia   3,899
## 143  143                       Bangladesh   3,891
## 144  144                         Cambodia   3,736
## 145  145                    Côte d'Ivoire   3,581
## 146  146                           Tuvalu   3,567
## 147  147                 Papua New Guinea   3,542
## 148  148                  Kyrgyz Republic   3,467
## 149  149                         Djibouti   3,370
## 150  150                            Kenya   3,360
## 151  151            São Tomé and Príncipe   3,344
## 152  152                         Cameroon   3,261
## 153  153                 Marshall Islands   3,240
## 154  154                          Lesotho   3,107
## 155  155                         Tanzania   3,097
## 156  156                       Micronesia   3,033
## 157  157                       Tajikistan   2,982
## 158  158                          Vanuatu   2,631
## 159  159                             Chad   2,580
## 160  160                          Senegal   2,578
## 161  161                            Yemen   2,521
## 162  162                            Nepal   2,481
## 163  163                             Mali   2,265
## 164  164                            Benin   2,185
## 165  165                           Uganda   2,067
## 166  166                  Solomon Islands   1,996
## 167  167                      Afghanistan   1,957
## 168  168                         Zimbabwe   1,953
## 169  169                         Ethiopia   1,916
## 170  170                           Rwanda   1,905
## 171  171                         Kiribati   1,821
## 172  172                     Burkina Faso   1,791
## 173  173                            Haiti   1,784
## 174  174                      South Sudan   1,671
## 175  175                       The Gambia   1,664
## 176  176                     Sierra Leone   1,652
## 177  177                    Guinea-Bissau   1,568
## 178  178                             Togo   1,546
## 179  179                          Comoros   1,529
## 180  180                       Madagascar   1,505
## 181  181                          Eritrea   1,322
## 182  182                           Guinea   1,271
## 183  183                       Mozambique   1,228
## 184  184                           Malawi   1,139
## 185  185                            Niger   1,114
## 186  186                          Liberia     882
## 187  187                          Burundi     818
## 188  188 Democratic Republic of the Congo     785
## 189  189         Central African Republic     656

Key functions: html_attrs

  • html_attrs(x) - extracts all attribute elements from a nodeset x
  • html_attr(x, name) - extracts the name attribute from all elements in nodeset x
  • Attributes are things in the HTML like href, title, class, style, etc.
  • Use these functions to find and extract your data
myhtml %>% 
  html_nodes("table") %>% extract2(2) %>%
  html_attrs()
##                                                  class 
##                                   "wikitable sortable" 
##                                                  style 
## "margin-left:auto;margin-right:auto;text-align: right"
myhtml %>% 
  html_nodes("p") %>% html_nodes("a") %>%
  html_attr("href")
##  [1] "/wiki/Purchasing_power_parity"                                                                 
##  [2] "/wiki/Goods_and_services"                                                                      
##  [3] "/wiki/Gross_domestic_product"                                                                  
##  [4] "/wiki/Per_capita"                                                                              
##  [5] "/wiki/International_Monetary_Fund"                                                             
##  [6] "/wiki/World_Bank"                                                                              
##  [7] "/wiki/National_wealth"                                                                         
##  [8] "/wiki/Savings"                                                                                 
##  [9] "/wiki/Cost_of_living"                                                                          
## [10] "/wiki/List_of_countries_by_GDP_(nominal)_per_capita"                                           
## [11] "https://en.wiktionary.org/wiki/generalized"                                                    
## [12] "/wiki/Living_standards"                                                                        
## [13] "/wiki/Inflation_rates"                                                                         
## [14] "/wiki/Exchange_rates"                                                                          
## [15] "#cite_note-2"                                                                                  
## [16] "#cite_note-3"                                                                                  
## [17] "/wiki/Personal_income"                                                                         
## [18] "/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_Wealth_distribution_and_externalities"
## [19] "/wiki/Economy"                                                                                 
## [20] "/wiki/Sovereign_state"                                                                         
## [21] "/wiki/Dependent_territories"                                                                   
## [22] "/wiki/Geary%E2%80%93Khamis_dollar"

Other functions

  • html_children - list the “children” of the HTML page. Can be chained like html_nodes
  • html_name - gives the tags of a nodeset. Use in a chain with html_children
myhtml %>% 
  html_children() %>% 
  html_name()
## [1] "head" "body"
  • html_form - parses HTML forms (checkboxes, fill-in-the-blanks, etc.)
  • html_session - simulate a session in an html browser; use the functions jump_to, back to navigate through the page

Your Turn #2

Find another website you want to scrape (ideas: all bills in the house so far this year, video game reviews, anything Wikipedia) and use at least 3 different rvest functions in a chain to extract some data.

Advanced Examples:
Into the Weeds

Example #1: Inaugural Addresses

The Data

  • The Avalon Project has most of the U.S. Presidential inaugural addresses.
  • Obama 2013, Trump 2017, VanBuren 1837, Buchanan 1857, Garfield 1881, and Coolidge 1925 are missing, but are easily found elsewhere. I have them saved as text files on Github
  • Let’s scrape all of them from The Avalon Project!

Get data frame of addresses

  • Could use another source to get this data of President names and years of inaugurations, but we’ll use The Avalon Project’s site because it’s a good example of data that needs tidying.
url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp"
# even though it's called "all inaugs" some are missing
all_inaugs <- (url %>% 
  read_html(url) %>% 
  html_nodes("table") %>% 
  html_table(fill=T, header = T)) %>% extract2(3)
# table of addresses
all_inaugs_tidy <- all_inaugs %>% 
  gather(term, year, -President) %>% 
  filter(!is.na(year)) %>% 
  select(-term) %>% 
  arrange(year)
head(all_inaugs_tidy)
##           President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3        John Adams 1797
## 4  Thomas Jefferson 1801
## 5  Thomas Jefferson 1805
## 6     James Madison 1809

Automate scraping

  • A function to read the addresses and get the text of the speeches, with a catch for a read error
get_inaugurations <- function(url){
  test <- try(url %>% read_html(), silent=T)
  if ("try-error" %in% class(test)) {
    return(NA)
  } else
    url %>% read_html() %>%
      html_nodes("p") %>% 
      html_text() -> address
    return(unlist(address))
}

# takes about 30 secs to run
all_inaugs_text <- all_inaugs_tidy %>% 
  mutate(address_text = (map(url, get_inaugurations))) 

all_inaugs_text$address_text[[1]]
## [1] " Fellow-Citizens of the Senate and of the House of Representatives: "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [2] "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated. "                                                                                                                                                                                                                                                                                                                                                                                           
## [3] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow- citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence. "                                                                                                                                                                                                                                                                                                                                              
## [4] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people. "
## [5] "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [6] "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [7] "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "

Add Missings

all_inaugs_text$President[is.na(all_inaugs_text$address_text)]
## [1] "Martin Van Buren"  "James Buchanan"    "James A. Garfield"
## [4] "Calvin Coolidge"
# there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping.
obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp")
obama13 <- readLines("speeches/obama2013.txt")
trump17 <- readLines("speeches/trumpinaug.txt")
vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13
buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18
garfield1881 <- readLines("speeches/garfield1881.txt") # row 24
coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35
all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925)

# lets combine them all now
recents <- data.frame(President = c(rep("Barack Obama", 2), 
                                    "Donald Trump"),
                      year = c(2009, 2013, 2017), 
                      url = NA,
                      address_text = NA)

all_inaugs_text <- rbind(all_inaugs_text, recents)
all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17)

Check-in: What did we do?

  1. We found some interesting data to scrape from the web.
  2. We used tidy tools to create tidy data:
    • A data frame of President and year. One observation per row!
    • Stored urls we wished to scrape with their data
    • Stored the scraped speech with the matching President, year, and url
  3. We used the consistent HTML structure of the urls we wanted to scrape to automate collection of web data
    • Way faster than copy-paste!
    • Though we had to do some by hand, we took advantage of the tidy data and added the missing data manually without much pain.
  4. We now have a tidy data set of Presidential inaugural addresses for text analysis!
    • Each variable forms a column
    • Each observation forms a row
    • Each type of observational unit forms a table

A (Small) Text Analysis

Now, I use the tidytext package to get the words out of each inaugural address.

# install.packages("tidytext")
library(tidytext)
all_inaugs_text %>% 
  select(-url) %>% 
  unnest() %>% 
  unnest_tokens(word, address_text) -> presidential_words
head(presidential_words)
##             President year     word
## 1   George Washington 1789   fellow
## 1.1 George Washington 1789 citizens
## 1.2 George Washington 1789       of
## 1.3 George Washington 1789      the
## 1.4 George Washington 1789   senate
## 1.5 George Washington 1789      and

Longest speeches

presidential_words %>% 
  group_by(President,year) %>% 
  summarize(num_words = n()) %>%
  arrange(desc(num_words)) -> presidential_wordtotals

Example #2: Notable Deaths

The Data

  • 2016 felt to many people like a year of loss: David Bowie, Prince, Alan Rickman, Carrie Fisher, and many more celebrities passed away in 2016
  • But were there really more “celebrity deaths” than any other year?
  • Wikipedia has a list of notable deaths every year, going all the way back to 1987.
  • We can scrape Wikipedia pages for this data.

Scraping Wikipedia

First, get all the URLs for the Wikipedia articles for the years of 1987-2016.

years <- 1987:2016
urls <- paste0("https://en.wikipedia.org/wiki/", years, "#Deaths")

Next, create a data frame to store all of the data.

celebDeaths <- data.frame(year = years, url = urls,
                          stringsAsFactors = FALSE)

Look at the HTML

urls[1] %>% read_html() %>% html_children() %>%
  html_nodes("h2")
## {xml_nodeset (8)}
## [1] <h2>Contents</h2>
## [2] <h2>\n<span class="mw-headline" id="Events">Events</span><span class ...
## [3] <h2>\n<span class="mw-headline" id="Births">Births</span><span class ...
## [4] <h2>\n<span class="mw-headline" id="Deaths">Deaths</span><span class ...
## [5] <h2>\n<span class="mw-headline" id="In_fiction">In fiction</span><sp ...
## [6] <h2>\n<span class="mw-headline" id="Nobel_Prizes">Nobel Prizes</span ...
## [7] <h2>\n<span class="mw-headline" id="References">References</span><sp ...
## [8] <h2>Navigation menu</h2>
urls[1] %>% read_html() %>% html_children() %>%
  html_nodes("li")
## {xml_nodeset (1361)}
##  [1] <li><a href="/wiki/19th_century" title="19th century">19th century< ...
##  [2] <li><b><a href="/wiki/20th_century" title="20th century">20th centu ...
##  [3] <li><a href="/wiki/21st_century" title="21st century">21st century< ...
##  [4] <li><a href="/wiki/1960s" title="1960s">1960s</a></li>
##  [5] <li><a href="/wiki/1970s" title="1970s">1970s</a></li>
##  [6] <li><b><a href="/wiki/1980s" title="1980s">1980s</a></b></li>
##  [7] <li><a href="/wiki/1990s" title="1990s">1990s</a></li>
##  [8] <li><a href="/wiki/2000s_(decade)" title="2000s (decade)">2000s</a> ...
##  [9] <li><a href="/wiki/1984" title="1984">1984</a></li>
## [10] <li><a href="/wiki/1985" title="1985">1985</a></li>
## [11] <li><a href="/wiki/1986" title="1986">1986</a></li>
## [12] <li><b><strong class="selflink">1987</strong></b></li>
## [13] <li><a href="/wiki/1988" title="1988">1988</a></li>
## [14] <li><a href="/wiki/1989" title="1989">1989</a></li>
## [15] <li><a href="/wiki/1990" title="1990">1990</a></li>
## [16] <li><a href="/wiki/1987_in_archaeology" title="1987 in archaeology" ...
## [17] <li><a href="/wiki/1987_in_architecture" title="1987 in architectur ...
## [18] <li><a href="/wiki/1987_in_art" title="1987 in art">Art</a></li>
## [19] <li><a href="/wiki/1987_in_aviation" title="1987 in aviation">Aviat ...
## [20] <li><a href="/wiki/Category:1987_awards" title="Category:1987 award ...
## ...

Start Scraping

  • Write a function for scraping all the years, just like with the Presidents’ inaugural addresses
  • Unfortunately, the lists aren’t as structured as the Wikipedia table
  • This creates some difficulties…
  • But, luckily, the same exact difficulties exist on each page, so we only have to deal with them once!

Write the function (1/2)

  • Heads up - this is a difficult example. Don’t worry if you don’t understand everything right away
  • Also, this is not a unique solution to this problem
get_deaths <- function(url){
  # get the main content page
  page <- url %>% read_html() %>% 
    html_nodes("#mw-content-text") %>% html_children()
  # get the names of all elements 
  tagnames <- page %>% html_name()
  # where are the big section headers
  h2s <- which(tagnames == "h2")
  # to find the heading labeled "Deaths"
  h2childids <- page[h2s] %>% html_children() %>% html_attr("id")
  idDeaths <- which(h2childids == "Deaths")
  # list of deaths starts after the location of deathStart and 
  # ends immediately before the location of deathEnd (next big header)
  deathStart <- h2s[(idDeaths+1)/2]
  deathEnd <- h2s[(idDeaths+1)/2+1]
  # get the deaths
  death_elements <- page[(deathStart+1):(deathEnd-1)] 
  deaths <- death_elements %>% html_nodes("li") %>% html_text()

(continued on next slide)

Write the function (2/2)

# there are two types of deaths: there was only one death that day in that year (a)
  deathsa <- data.frame(death = deaths[grep("–", deaths)])
  deathsa <- deathsa %>% 
    separate(death, into = c("Date", "Person"), sep = " – ") %>% 
    separate(Date, into = c("Month", "Day"), sep = " ") %>%
    separate(Person, into = c("Name", "Desc"), sep = ", ", extra = "merge") 
  # or there were multiple deaths that day in that year (b) 
  deathsb <- data.frame(death = deaths[-grep("–", deaths)], stringsAsFactors = F)
  # remove repeats
  deathsb <- data.frame(death = deathsb[grep("\n",deathsb$death),], stringsAsFactors = F)
  # tidy up the data
  deathsb %>% 
    separate(death, into = c("Date", "Other"), sep = "\\n", extra="merge") %>%
    separate(Other, into = paste0("Person", 1:6), sep = "\\n", fill = "right") %>% 
    gather(Person, Desc, -Date) %>% 
    select(Date, Desc) %>%
    filter(!is.na(Desc)) -> deathsb
  deathsb %>% separate(Desc, into = c("Name", "Desc"), sep = ", ", extra = "merge") %>%
    separate(Date, into = c("Month", "Day"), sep = " ") %>%
    filter(!is.na(Desc)) -> deathsb
  #combine the 2 sets
  deaths <- rbind(deathsa, deathsb)

  return(deaths)
} 

Use the function!

  • Use the same tidy principles we used for the inaugural example.
# should take about 10 seconds
celebDeaths <- celebDeaths %>% 
  mutate(Deaths = map(url, get_deaths)) %>%
  unnest()
head(celebDeaths[,-2])
##   year   Month Day           Name
## 1 1987 January   6 Harry D. Payne
## 2 1987 January   9    Arthur Lake
## 3 1987 January  10  Hakan Malmrot
## 4 1987 January  14   Douglas Sirk
## 5 1987 January  15     Ray Bolger
## 6 1987 January  19  Gerald Brenan
##                                                                                                                                              Desc
## 1                                                                                                                    American architect (b. 1891)
## 2                                                                                           American actor, Dagwood Bumstead in Blondie (b. 1905)
## 3                                                                                                                       Swedish swimmer (b. 1900)
## 4 German-born film director, Hollywood melodramas Magnificent Obsession, All That Heaven Allows, Written on the Wind, Imitation of Life (b. 1897)
## 5                                                                     American actor, singer, and dancer. Scarecrow in The Wizard of Oz (b. 1904)
## 6                                                                                                          British writer and Hispanist (b. 1894)

Check-in: What did we do?

  1. We found some interesting data to scrape from the web.
  2. We used tidy tools to create tidy data:
    • Years and Wikipedia pages associated with them
    • Stored the scraped data with the matching year and URL
  3. We spent some time decoding the HTML & figuring out how to find where our data was stored
    • Struggled with lack of structure in the lists we wanted
    • Not a unique solution
  4. Wrote a function to scrape a page; applied it to each year in our data
  5. Output: A tidy data frame of one person per row with dates, names, and descriptions

A (Small) Data Analysis

  • We want to know if 2016 really was a very significant year of celebrity deaths
  • Let’s get a quick count
celebDeaths %>% 
  group_by(year) %>% 
  summarise(num_deaths = n()) %>% 
  arrange(desc(num_deaths)) %>% 
  head(10)
## # A tibble: 10 × 2
##     year num_deaths
##    <int>      <int>
## 1   2016        358
## 2   1993        314
## 3   2015        309
## 4   1990        305
## 5   1991        294
## 6   1992        286
## 7   1989        266
## 8   1996        265
## 9   1995        249
## 10  1998        248

Over time?

  • Some people have postulated that there is an increase in deaths because we are 50+ years out from the cultural revolution of the 1960s.
  • Let’s see if there’s a trend over time:

Conclusion

What did we do?

  • Learned about webscraping and why you’d want to do it
  • Saw some resources for webscraping in R
  • Got to know the tidyverse
  • Scraped data from the web with rvest
  • Discovered the longest inaugural address given by a US President was over 8,000 words
  • Found out that 2016 really was a major year in celebrity deaths
  • Had fun!

Thank you!

  • Questions? We have the room until 6pm!